Basis Adaptation for Sparse Nonlinear Reinforcement Learning

نویسندگان

  • Sridhar Mahadevan
  • Stephen Giguere
  • Nicholas Jacek
چکیده

This paper presents a new approach to representation discovery in reinforcement learning (RL) using basis adaptation. We introduce a general framework for basis adaptation as nonlinear separable least-squares value function approximation based on finding Fréchet gradients of an error function using variable projection functionals. We then present a scalable proximal gradientbased approach for basis adaptation using the recently proposed mirror-descent framework for RL. Unlike traditional temporal-difference (TD) methods for RL, mirror descent based RL methods undertake proximal gradient updates of weights in a dual space, which is linked together with the primal space using a Legendre transform involving the gradient of a strongly convex function. Mirror descent RL can be viewed as a proximal TD algorithm using Bregman divergence as the distance generating function. We present a new class of regularized proximal-gradient based TD methods, which combine feature selection through sparse L1 regularization and basis adaptation. Experimental results are provided to illustrate and validate the approach. Introduction There has been rapidly growing interest in reinforcement learning in representation discovery (Mahadevan 2008). Basis construction algorithms (Mahadevan 2009) combine the learning of features, or basis functions, and control. Basis adaptation (Bertsekas and Yu 2009; Castro and Mannor 2010; Menache, Shimkin, and Mannor 2005) enables tuning a given parametric basis, such as the Fourier basis (Konidaris, Osentoski, and Thomas 2011), radial basis functions (RBF) (Menache, Shimkin, and Mannor 2005), polynomial bases (Lagoudakis and Parr 2003), etc. to the geometry of a particular Markov decision process (MDP). Basis selection methods combine sparse feature selection through L1 regularization with traditional least-squares type RL methods (Kolter and Ng 2009; Johns, Painter-Wakefield, and Parr 2010), linear complementarity methods (Johns, Painter-Wakefield, and Parr 2010), approximate linear programming (Petrik et al. 2010), or convex-concave optimization methods for sparse off-policy TD-learning (Liu, Mahadevan, and Liu 2012). Copyright c © 2013, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. In this paper, we present a new framework for basis adaptation as nonlinear separable least-squares approximation of value functions using variable projection functionals. This framework is adapted from a well-known classical method for nonlinear regression (Golub and Pereyra 1973), when the model parameters can be decomposed into a linear set, which is fit by classical least-squares, and a nonlinear set, which is fit by a Gauss-Newton method based on computing the gradient of an error function based on variable projection functionals. Mirror descent is a highly scalable online convex optimization framework (Nemirovksi and Yudin 1983). Online convex optimization (Zinkevich 2003) explores the use of first-order gradient methods for solving convex optimization problems. Mirror descent can be viewed as a first-order proximal gradient based method (Beck and Teboulle 2003) using a distance generating function that is a Bregman divergence (Bregman 1967). We combine basis adaptation with mirrordescent RL (Mahadevan and Liu 2012), a recently developed first-order approach to sparse RL. The proposed approach is also validated with some experiments showing improved performance compared to previous work. Reinforcement Learning Reinforcement learning can be viewed as a stochastic approximation framework (Borkar 2008) for solving MDPs, which are defined by a set of states S, a set of (possibly state-dependent) actions A (As), a dynamical system model comprised of the transition probabilities P a ss′ specifying the probability of transitioning to state s′ from state s under action a, and a reward model R specifying the payoffs received. A policy π : S → A is a deterministic mapping from states to actions. Associated with each policy π is a value function V π , which is a fixed point of the Bellman equation: V π = T(V ) = R + γPV π (1) where 0 ≤ γ < 1 is a discount factor, and T is the Bellman operator. An optimal policy π∗ is one whose associated value function dominates all others, and is defined by the following nonlinear system of equations:

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Reinforcement Learning with Orthonormal Basis Adaptation Based on Activity-Oriented Index Allocation

An orthonormal basis adaptation method for function approximation was developed and applied to reinforcement learning with multi-dimensional continuous state space. First, a basis used for linear function approximation of a control function is set to an orthonormal basis. Next, basis elements with small activities are replaced with other candidate elements as learning progresses. As this replac...

متن کامل

A Novel Image Denoising Method Based on Incoherent Dictionary Learning and Domain Adaptation Technique

In this paper, a new method for image denoising based on incoherent dictionary learning and domain transfer technique is proposed. The idea of using sparse representation concept is one of the most interesting areas for researchers. The goal of sparse coding is to approximately model the input data as a weighted linear combination of a small number of basis vectors. Two characteristics should b...

متن کامل

Pii: S0378-4754(99)00117-2

A novel approach to adaptive direct neurocontrol is discussed in this paper. The objective is to construct an adaptive control scheme for unknown time-dependent nonlinear plants without using a model of the plant. The proposed approach is neural-network based and combines the self-tuning principle with reinforcement learning. The control scheme consists of a controller, a utility estimator, an ...

متن کامل

A New Nonlinear Reinforcement Scheme for Stochastic Learning Automata

Reinforcement schemes represent the basis of the learning process for stochastic learning automata, generating their learning behavior. An automaton using a reinforcement scheme can decide the best action, based on past actions and environment responses. The aim of this paper is to introduce a new reinforcement scheme for stochastic learning automata. We test our schema and compare with other n...

متن کامل

Basis Function Adaptation in Temporal Difference Reinforcement Learning

We examine methods for on-line optimization of the basis function for temporal difference Reinforcement Learning algorithms. We concentrate on architectures with a linear parameterization of the value function. Our methods optimize the weights of the network while simultaneously adapting the parameters of the basis functions in order to decrease the Bellman approximation error. A gradient-based...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013